Structural preferential attachment: Network organization beyond the link. 
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We introduce a mechanism which models the emergence of the universal properties of complex networks, 
such as scale independence, modularity and self-similarity, and unifies them under a scale-free organization 
beyond the link. This brings a new perspective on network organization where communities, instead of links, 
are the fundamental building blocks of complex systems. We show how our simple model can reproduce social 
and information networks by predicting their community structure and more importantly, how their nodes or 
communities are interconnected, often in a self-similar manner. 
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A universal matter. Reducing complex systems to their 
simplest possible form while retaining their important prop- 
erties helps model their behavior independently of their na- 
ture. Results obtained via these abstract models can then be 
transferred to other systems sharing a similar simplest form. 
Such groups of analog systems are called universality classes 
and are the reason why some models apply just as well to 
the sizes of earthquakes or solar flares than to the sales num- 
ber of books or music recordings That is, their statisti- 
cal distributions can be reproduced by the same mechanism: 
preferential attachment. This mechanism has been of special 
interest to network science |2| because it models the emer- 
gence of power-law distributions for the number of links per 
node. This particular feature is one of the universal properties 
of network structure |3|, alongside modularity |4l and self- 
similarity |5|. Previous studies have focused on those prop- 
erties one at a time [3-8 1, yet a unified point of view is still 
wanting. In this Letter, we present an overarching model of 
preferential attachment that unifies the universal properties of 
network organization under a single principle. 

Preferential attachment is one of the most ubiquitous mech- 
anisms describing how elements are distributed within com- 
plex systems. More precisely, it predicts the emergence of 
scale-free (power-law) distributions where the probability Pk 
of occurrence of an event of order k decreases as an inverse 
power of k (i.e., oc k^y with y > 0). It was initially in- 
troduced outside the realm of network science by Yule ||3 as 
a mathematical model of evolution explaining the power-law 
distribution of biological genera by number of species. In- 
dependently, Gibrat ifTOl formulated a similar idea as a law 
governing the growth rate of incomes. Gibrat's law is the sole 
assumption behind preferential attachment: the growth rates 
of entities in a system are proportional to their size. Yet, pref- 
erential attachment is perhaps better described using Simon's 
general balls-in-bins process [HJ. 

Simon's model was developed for the distribution of words 
by their frequency of occurrence in a prose sample |[T2l . The 
problem is the following: what is the probability Pk+\{i + 1) 
that the (; -H \)-th word of a text is a word that has already 
appeared k times? By simply stating that Pk+\{i -\- 1) cc k ■ 
Pk{i), Simon obtained the desired distribution [Fig. l(a)| . In 



this model, the nature of the system is hidden behind a simple 
logic: the "popularity" of an event is encoded in its number of 
past occurrences. More clearly, a word used twice is 2 times 
more likely to reappear next than a word used once. However, 
before its initial occurrence, a word has appeared exactly zero 
times, yet it has a certain probability p of appearing for the 
very first time. Simon's model thus produces systems whose 
distribution of elements falls as a power law of exponent y = 
(2-p)/(l-p). 

On the matter of networks. Networks are ensembles of 
potentially linked elements called nodes. In the late 1990s, 
it was found that the distribution of links per node (the de- 
gree distribution) featured a power-law tail for networks of 
diverse nature. To model these so-called scale-free networks, 
Barabasi and Albert O introduced preferential attachment in 
network science. In their model, nodes are added to the net- 
work and linked to a certain number of existing nodes. The 
probability that the new node chooses an old one of degree k 
is proportional to k ■ Nk, where A^^ is the number of nodes of 
degree k. As the system goes to infinity, A^^ falls off as k^^. 

From the perspective of complex networks, Simon's model 
may be regarded not as a scheme of throwing balls (e.g., 
word occurrences) in bins (e.g., unique words), but as an ex- 
treme case of scale-free networks where all links are shared 
within clearly divided structures. Obviously, both Simon's 
and the Barabasi-Albert's (BA) models follow the preferential 
attachment principle. However, Simon's model creates dis- 
tinct growing structures, whereas the BA model creates over- 
lapping links of fixed size. By using the same principle, one 
creates order while the other creates randomness [Fig |l(b)| . 
Our approach explores the systems that lie in between. 

When structure matters. The vast majority of natural net- 
works have a modular topology where links are shared within 
dense subunits [ID. These structures, or communities, can be 
identified as social groups, industrial sectors, protein com- 
plexes or even semantic fields [13|. They typically overlap 
with each other by sharing nodes and their number of neigh- 
boring structures is called their community degree. This par- 
ticular topology is often referred to as community structure 
[Fig. l(b)| . Because these structures are so important on a 



global level, they must influence local growth. Consequently, 
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FIG. 1. (Color online) (a) The distribution of words by their number of appearances in James Joyce's Ulysses (empirical data). The numerical 
data was obtained from a single realization of Simon's model with p equal to the ratio of unique words (30 030) on the total word count (267 
350). (b) Schematization of the systems considered in this Letter, illustrating how order (Simon's model of balls in bins) and randomness 
(Barabasi-Albert's model of random networks) coexist in a spectrum of complex systems, (c) The distribution of coactors and movies per actor 
in the Internet Movie Database since 2000. The organization moves closer to a true power law when looking at a higher structural level (i.e., 
movies versus coactors). 



they are at the core of our model. 

The use of preferential attachment at a higher structural 
level is motivated by three observations. First, the number 
of communities an element belongs to, its membership num- 
ber, is often a better indicator of its activity level than its total 
degree. For instance, we judge an actor taking part in many 
small dramas more active than one cast in a single epic movie 
as one of a thousand extras, as we may consider a protein part 
of many complexes more functional than one found in a single 
big complex. 

Second, studies have hinted that Gibrat's law holds true for 
communities within social networks [14] . The power-law dis- 
tribution of community sizes recently observed in many sys- 
tems (e.g., protein interaction, word association and social 
networks I.13J or metabolite and mobile phone networks L15J ) 
supports this hypothesis. 

Third, degree distributions can deviate significantly from 
true power laws, while higher structural levels might be better 



suited for preferential attachment models [Fig. 1(c) |. 

A simple model. Simon's model assigns elements to 
structures chosen proportionally to their sizes, while the BA 
model creates links between elements chosen proportionally 
to their degree. We thus define structural preferential attach- 
ment (SPA), where both elements and structures are chosen 
according to preferential attachment. Here, links will not be 
considered as a property of two given nodes, but as part of 
structures that can grow on the underlying space of nodes and 
eventually overlap. 

Our model can be described as the following stochastic pro- 
cess. At every time step, a node joins a structure. The node 
is a new one with probability q, or an old one chosen pro- 
portionally to its membership number with probability \ - q. 
Moreover, the structure is a new one of size s with probability 
p, or an old one chosen among existing structures proportion- 
ally to their size with probability 1 - p. These two growth 
parameters are directly linked to two measurable properties: 



modularity (p) and connectedness (q) [Fig. [2]. Note that, at 
this point, no assumption is made on how nodes are linked 
within structures; our model focuses on the modular organi- 
zation. 

Whenever the structure is a new one, the remaining s - I 
elements involved in its creation are once again preferentially 
chosen among existing nodes. The basic structure size s is 
called the system base and refers to the smallest structural unit 
of the system. It is not a parameter of the model per se, but de- 
pends on the considered system. For instance, the BA model 
directly creates links, i.e. s = 2 (with p - q - 1), unlike 
Simon's model which uses s = 1 (with q - 0). All the re- 
sults presented here use a node-based representation {s - 1), 
although they can equally well be reproduced via a link-based 
representation {s - 2). In fact, for sufficiently large systems, 
the distinction between the two versions seems mainly con- 
ceptual (see Supplemental Material for details 1 161 ). 

In our process, the growth of structures is not necessarily 
dependent on the growth of the network (i.e., the creation 
of nodes). Consequently, we can reproduce statistical prop- 
erties of real networks without having to consider the large- 
size limit of the process. This allows our model to naturally 
include finite size effects (e.g., a distribution cutoff) and in- 
creases freedom in the scaling properties. In fact, we can fol- 
low S„ and N,n, respectively, the number of structures of size 
n and of nodes with m memberships, by writing master equa- 
tions for their time evolution ifTTll : 



c n , (»- l)S „-i{t) - nS „{t) 

Sn{t) = {l-p) j— — — j- +p6n,.; (1) 

[1 + pis - \)\t 

M ( 1^ , (w-l)jV„,-i(0-mjV,„(f) ^ 

N,„{t) ^(l+p(s-l)-q) — +q6m,\ ■ (2) 

[1 +p{s- l)\t 

Equations ([T]) and Q can be transformed into ODEs for the 
evolution of the distribution of nodes per structure and struc- 
ture per node by normalizing S „ and Nm by the total number 
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FIG. 2. (Color online) (top) Representation of the possible events in 
a step of node-based SPA; the probability of each event is indicated 
beneath it. (bottom) A schematization of the spectrum of systems 
obtainable with SPA. Here, we illustrate the conceptual differences 
between node-based s = \ and link-based systems s = 2: Simon's 
model {q = \) creates structures of size one (nodes), while the BA 
model (p = ^ = 1) creates random networks through structures of 
size two (links). 



of structures and nodes, pt and qt, respectively. One then ob- 
tains recursively the following solutions for the normalized 
distributions at statistical equilibrium, {S*^] and {AT*,}: 



SI 



N* = 



n 
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which scale as indicated in Tablejl] TV,* oc m"''" and S*„ oc iry^ . 

Results and discussions. There are three distributions of 
interest which can be directly obtained from SPA: the mem- 
bership, the community size, and the community degree distri- 
butions. In systems such as the size of business firms or word 
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TABLE I. Exponents of the power-law distributions of structures per 
element (membership) and of elements per structure (size) at statis- 
tical equilibrium. One easily verifies that the membership scaling 
of link-based systems with p = q = I corresponds to that of the BA 
model (yiy = 3), and that node-based systems with q = I reproduce 
Simon's model. See Supplemental Material for the derivation il6i . 



frequencies, these distributions suffice to characterize the or- 
ganization. To obtain them, the SPA parameters, q and p, are 
fitted to the empirical scaling exponents of the membership 
and community size distributions. In complex networks, one 
may also be interested in the degree distribution. Additional 
assumptions are then needed to determine how nodes are in- 
terconnected within communities (specified when required). 

The first set of results considered is the community struc- 
ture of the coautorship network of an electronic preprints 
archive, the cond-mat arXiv circa 2005 [Fig. 3(a)| , whose 
topology was already characterized using a clique percolation 
method |13|. Here, the communities are detected using the 
link community algorithm of Ahn et al. ifTSl . confirming pre- 
vious results. 

Using only two parameters, our model can create a sys- 
tem of similar size with an equivalent topology according to 
the four distributions considered (community sizes, member- 
ships, community degree and node degree). Not only does 
SPA reproduce the correct density of stiTictures of size 2, 3, 4 
or more, but it also coiTectly predicts how these structures are 
interconnected via their overlap, i.e., the community degree. 
This is achieved without imposing any constraints whatsoever 
for this property. The first portion of the community degree 
distribution is approximately exponential; a behavior which 
can be observed in other systems, such as the Internet [Fig. 



3(b) I and both a protein interaction and a word-association 
network 1 13 1. To our knowledge, SPA is the first growth pro- 
cess to reproduce such community structured systems. 

Moreover, assuming fully connected structures, SPA cor- 
rectly produces a similar behavior in the degree distribution 
of the nodes. Obtaining this distribution alone previously re- 
quired two parameters and additional assumptions [7 1. In con- 
trast, SPA shows that this is a signature of a scale-free com- 
munity structure. This is an interesting result in itself, since 
most observed degree distributions follow a power law only 
asymptotically. Fuithermore, this particular result also illus- 
trates how self-similarity between different structural levels 
(i.e., node degree and community degree distributions) can 
emerge from the scale-free organization of communities. 

Finally, the Internet Movie Database coacting network is 
used to illustrate how, for bigger and sparser communities 
which cannot be considered fully connected, one can still eas- 
ily approximate the degree distribution. We first observe that 
the mean density of links in communities of size n approxi- 
mately behaves as log(«)/n (see Supplemental Material 1 16]). 
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FIG. 3. (Color online) Circles: distributions of topological quantities for (a) the cond-mat arXiv circa 2005; (b) Internet at the level of 
autonomous systems circa 2007; (c) the IMDb network for movies released since 2000. Solid lines: average over multiple realizations of the 
SPA process with (a) p = 0.56 and q = 0.59; (b) p = 0.04 and q = 0.66; (c) p = 0.47 and q = 0.25. For each realization, iterations are pursued 
until an equivalent system size is obtained. The Internet data highlights the transition between exponential and scale-free regimes in a typical 
community degree distribution. It is represented by a single realization of SPA (dots), because averaging masks the transition. 



Then, using a simple binomial approximation to connect the 
nodes within communities, it is possible to approximate the 
correct scaling behavior for the degree distribution [Fig. 3(c)| . 
This method takes advantage of the fact that communities are, 
by definition, homogeneous such that their internal organiza- 
tion can be considered random. 

Conclusion and perspective. In this Letter, we have de- 
veloped a complex network organization model where con- 
nections are built through growing communities, whereas past 
efforts typically tried to arrange random links in a scale-free, 
modular and/or self-similar manner. Our model shows that 
these universal properties are a consequence of preferential 
attachment at the level of communities: the scale-free organi- 
zation is inherited by the lower structural levels. 

Looking at network organization beyond the link is also 
useful to account for missing links [18] or to help realistic 
modeling lfT9ll20l . For instance, this new paradigm of scale- 
free community structure suggests that nodes with the most 
memberships, i.e., structural hubs, are key elements in propa- 
gating epidemics on social networks or viruses on the Internet. 
These structural hubs connect many different neighborhoods, 
unlike standard hubs whose links can be redundant if shared 
within a single community. 

There is no denying that communities can interact in more 
complex ways through time [21 1. Yet, from a statistical point 
of view, those processes can be neglected in the context of a 
structurally preferential growth. Similarly, even though other 
theories generating scale-free designs exist 1221 . they could 
also benefit from generalizing their point of view to higher 
levels of organization. 
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